Introduction to Open Data Science - Course Project

About the project

I’m began my PhD studies at University of Helsinki this Autumn and I’m expecting to be using R as part of my research project in musicology, employing a digital humanities approach. I’ve previously been taking courses on statistics (including analysis in R) via the open universities at University of Helsinki and the Hanken School of Economics. I recently got a new computer (and unfortunately lost most of my scripts from earlier courses), so I’m taking a fresh start on using R again now.

Hitherto, I’ve most often processed my numbers using spreadsheets such as Excel and Google Sheets, and I’ve developed techniques to do my calculations quickly in these programs. The downside to this is that I’ve most often felt that the threshold has been lower to use a spreadsheet than using R when processing numbers (and the quickest calculator is still the Spotlight in Mac Os X). So during this course, I hope to be able to learn some useful techniques for making sense of data quickly. Perhaps R Studio might become my program of choice some day. :)

The address to my Github repository is http://github.com/wilhelmkvist/IODS-project.

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Mon Dec  6 00:46:31 2021"

RStudio Exercise 2

Describe the work you have done this week and summarize your learning.

date()
## [1] "Mon Dec  6 00:46:31 2021"

1. Learning 2014 - an overview of the dataset

The dataset ‘Learning 2014’ contains data from a survey among students in the social sciences at University of Helsinki. The data was collected during an introductory course in statistics in late 2014 and early 2015. Participation was voluntary, but students were strongly encouraged to take part as they were reimbursed with extra points in the final exam. Respondents (N=183) were presented with a set of questions designed to reflect their attitudes towards statistics and university studies in general. Learning approaches were measured with questions designed to reflect deep, surface, and strategic approaches. The present dataset (a subset of a larger dataset) contains 166 observations (by all those who participated in the final exam) of 7 variables. In addition to mean scores reflecting attitude and learning approaches, the total points in the final exam are provided. Background information was provided on the respondents’ age and gender (female/male).

setwd("~/Documents/IODS-project")
learning2014 <- 
read.table("data/learning2014.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE)
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

Below is an overview of the different variables. The course was clearly dominated by females, accounting for two of three students. Note the age span from 17 to 55 years (mean 25.5 years, median 22 years).

summary(learning2014)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

2. Exploring the dataset graphically

The chart below outlines attitude, learning approaches and final exam score (points) by gender. Males generally scored higher on attitude and deep learning approach, while females scored higher on strategic and surface learning approaches. The gender divide in the final exam was about equal, with only a few males outperforming females in the range with the very highest scores. Although males were generally slightly older, age and final score were not positively but negatively correlated for males (-0.24), for females the correlation was insignificant (-0.02). The strongest correlation was identified between attitude and final score (0.44), about equal between males and females. There was a slight correlation between attitude and deep learning approaches (0.11), even more so among males (0.17). However, there was no correlation between deep learning approaches and final scores (-0.01). Apart from attitude, demonstrating a strategic learning approach was positively correlated with points (0.15), while the demonstration of a surface learning approach was negatively correlated with final scores (-0.14).

library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

3. Fitting a regression model

Based on the correlation between variables (analyzed above), I choose to fit a regression model in order to understand the impact of age, attitude and learning approach on exam points. As it turns out, the student’s attitude is highly significant (p < 0.001), while age and a strategic learning approach are moderately significant (p = 0.0981 and 0.0621 respectively).

options(scipen = 999) #to indicate the minimal p-values with zeros
model1 <- lm(points ~ age + attitude + stra, data = learning2014)
summary(model1)
## 
## Call:
## lm(formula = points ~ age + attitude + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.1149  -3.2003   0.3303   3.4129  10.7599 
## 
## Coefficients:
##             Estimate Std. Error t value      Pr(>|t|)    
## (Intercept) 10.89543    2.64834   4.114 0.00006171726 ***
## age         -0.08822    0.05302  -1.664        0.0981 .  
## attitude     3.48077    0.56220   6.191 0.00000000472 ***
## stra         1.00371    0.53434   1.878        0.0621 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared:  0.2182, Adjusted R-squared:  0.2037 
## F-statistic: 15.07 on 3 and 162 DF,  p-value: 0.0000000107

For the sake of comparison, I choose to make a second model. Excluding the modestly significant variables ‘age’ and ‘stra’ does not really seem to make any big difference (I have not conducted a hypothesis test between the models, but the intercept and attitude remain highly significant in both models.)

model2 <- lm(points ~ attitude, data = learning2014)
summary(model2)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value      Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 0.00000000195 ***
## attitude      3.5255     0.5674   6.214 0.00000000412 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 0.000000004119

4. Predicting exam scores

I choose to go forth with the first model (model1). According to the model, exam scores can be predicted if we know the person’s age, attitude and scores measuring their strategic learning approach. As stated earlier, age is slightly negatively correlated with exam score, while attitude is to a much larger degree positively correlated with exam score. A strategic learning approach is moderately correlated with exam scores.

Let us make two example calculations, one for a moderately motivated young undergraduate (“Mike”), another for a highly motivated late bloomer (“Ritva”). Mike, age 22, scores on attitude and strategic learning 2 out of 5. Ritva, age 55, cores on attitude and strategic learning 5 out of 5. We can now predict their exam scores:

students <- c('Mike','Ritva')
age <- c(22, 55)
attitude <- c(2,5)
stra <- c(2,5)
new_data <- data.frame(students, age, attitude, stra)
predict(model1, newdata = new_data)
##        1        2 
## 17.92358 28.46580

Effectively, R here makes the following calculations using these coefficients:

model1$coefficients
## (Intercept)         age    attitude        stra 
## 10.89542805 -0.08821869  3.48076791  1.00371238

Calculating predictions using the coefficients can be done in the following manner. (Differences in decimals are due to rounding errors.)

#Mike:
10.89543+(-0.08822*22)+(3.48077*2)+(1.00371*2)
## [1] 17.92355
#Ritva:
10.89543+(-0.08822*55)+(3.48077*5)+(1.00371*5)
## [1] 28.46573

The R-squared value (0.2182) can be used to summarize how well the regression line fits the data. Using the R-squared value, we see that the model makes a fairly good fit, explaining about 22 per cent of the sample variance. (In a simplified manner, one could state that a fifth of the variation in exam scores is explained by the students’ attitudes.)

summary(model1)$r.squared
## [1] 0.218172

5. Graphical model validation

In the following, I will produce three diagnostic plots in order to graphically explore the validity of my model assumptions. By analyzing residuals vs fitted values, I explore the validity of my model. Using the QQ-plot I explore whether errors are normally distributed. By exploring residuals vs leverage I want to find out whether there are any outliers. In this case, errors seem to be normally distributed, not correlated, and having a constant variance of Sigma^2. No severe outliers are identified.

par(mfrow = c(2,2))
plot(model1, which = c(1,2,5))


RStudio Exercise 3

Describe the work you have done this week and summarize your learning.

date()
## [1] "Mon Dec  6 00:46:41 2021"

1. Exploring alcohol consumption among Portuguese secondary school students - an overview of the Alc dataset

The Alc dataset provides data on alcohol consumption among Portuguese secondary school students aged 15 to 22 (mean 16.6 years). The data was collected using school reports and questionnaires. Attributes include student grades, demographic, social and school related features. The Alc dataset was constructed by joining two separate sets on student performance in two distinct subjects: Mathematics and Portuguese. As some students were known to have appeared in both sets, duplicates were identified and removed.

The Alc dataset features a total of 370 observations of 33 variables (listed below). Note: alc_use was calculated as a mean of workday alcohol consumption and weekday alcohol consumption, high_use is a binary variable indicating whether the student drinks more than 2 doses per day on average. The data was used in a paper by Cortez and Silva (2008). More information on the dataset along with attribute descriptions can be found at The Machine Learning Repository website.

alc <- read.table("data/alc.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE)
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

There are a number of variables that can be studied to understand student alcohol consumption. To get a quick overview of the content and distribution of each variable we can use the following code:

fig.width=8
fig.height=6
library(tidyr); library(dplyr); library(ggplot2)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped

In order to find out which variables could possibly be statistically relevant, I begin by running a regression model explaining “alc_use” with all other variables (except for “Dalc”, “Walc” and “high_use” which are directly related to “alc_use”).

# exclude variables "Dalc","Walc","high_use" which are directly related to "alc_use"
selvarnames <- names(alc) %in% c("Dalc","Walc","high_use")
alc2 <- alc[!selvarnames == T]
# create formula with where "alc_use" is explained by select variables (stored in alc2)
fmla <- as.formula(paste("alc_use~",paste(names(alc2)[1:31],collapse="+")))
# create linear regression model and read the summary
model1 <- lm(data=alc2, formula = fmla)
summary(model1)
## 
## Call:
## lm(formula = fmla, data = alc2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.65003 -0.49853 -0.08492  0.38197  2.87986 
## 
## Coefficients:
##                   Estimate Std. Error t value        Pr(>|t|)    
## (Intercept)      -0.308078   0.928209  -0.332        0.740172    
## schoolMS         -0.206366   0.166819  -1.237        0.216943    
## sexM              0.453517   0.098064   4.625 0.0000053993241 ***
## age               0.067109   0.043688   1.536        0.125478    
## addressU         -0.232701   0.119806  -1.942        0.052951 .  
## famsizeLE3        0.159470   0.099574   1.602        0.110221    
## PstatusT          0.016944   0.148389   0.114        0.909161    
## Medu              0.021611   0.065772   0.329        0.742688    
## Fedu              0.055074   0.056854   0.969        0.333409    
## Mjobhealth       -0.252106   0.227986  -1.106        0.269622    
## Mjobother        -0.111991   0.146597  -0.764        0.445450    
## Mjobservices     -0.135048   0.166597  -0.811        0.418166    
## Mjobteacher      -0.066798   0.210155  -0.318        0.750799    
## Fjobhealth        0.066192   0.305988   0.216        0.828870    
## Fjobother         0.179164   0.224244   0.799        0.424887    
## Fjobservices      0.399310   0.231379   1.726        0.085325 .  
## Fjobteacher      -0.115857   0.271994  -0.426        0.670421    
## reasonhome        0.045740   0.113435   0.403        0.687044    
## reasonother       0.409977   0.164654   2.490        0.013270 *  
## reasonreputation  0.056384   0.117446   0.480        0.631488    
## guardianmother   -0.149224   0.108597  -1.374        0.170344    
## guardianother    -0.232964   0.235168  -0.991        0.322593    
## traveltime        0.152445   0.069143   2.205        0.028161 *  
## studytime        -0.115010   0.058181  -1.977        0.048903 *  
## schoolsupyes      0.021980   0.140610   0.156        0.875877    
## famsupyes        -0.057759   0.097938  -0.590        0.555763    
## activitiesyes    -0.185204   0.090385  -2.049        0.041249 *  
## nurseryyes       -0.261139   0.112122  -2.329        0.020461 *  
## higheryes         0.237373   0.241308   0.984        0.325989    
## internetyes       0.042568   0.131414   0.324        0.746201    
## romanticyes       0.052334   0.097849   0.535        0.593119    
## famrel           -0.178569   0.048777  -3.661        0.000293 ***
## freetime          0.069539   0.048266   1.441        0.150606    
## goout             0.293976   0.042174   6.971 0.0000000000173 ***
## health            0.056055   0.032523   1.724        0.085729 .  
## failures          0.169898   0.089496   1.898        0.058521 .  
## paidyes           0.287764   0.095692   3.007        0.002840 ** 
## absences          0.028011   0.008503   3.294        0.001094 ** 
## G1               -0.057193   0.037889  -1.509        0.132136    
## G2                0.042812   0.047726   0.897        0.370356    
## G3                0.006646   0.034650   0.192        0.848011    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8055 on 329 degrees of freedom
## Multiple R-squared:  0.4191, Adjusted R-squared:  0.3485 
## F-statistic: 5.935 on 40 and 329 DF,  p-value: < 0.00000000000000022

From the regression model summary, I can read that male sex is a highly significant variable as well as the quality of family relationships and going out with friends. The variables “paid” (extra paid classes in Math or Portuguese) and “absences” (number of school absences) are fairly significant. Moderately significant variables include study time, travel time, whether the student attended nursery school and whether the student has taken part in extra-curricular activities.

In order to understand the relationship between high alcohol consumption and select variables, I choose to visualize the relationship between high alcohol comsumption and sex/gender, family relationships, going out and study time. For the analysis, I construct a subset comprising columns 2 (sex), 22 (family relationship), 24 (going out), 14 (study time) and 35 (high_use. The visualizsation is presented in sex-disaggregated form.

library(ggplot2)
library(GGally)
#construct subset comprising columns 2 (sex), 22 (family relationship), 24 (going out), 14 (study time) and 35 (high_use)
alcpairs <- alc[, c(2, 22,24,14,35)]
#plot the pairs
ggpairs(alcpairs, mapping = aes(col=sex, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

From the visual presentation, I can formulate the following working hypotheses for the select variables:

  • sex: men are more likely than women to be high consumers
  • famrel: students with high alcohol consumption generally score lower on the quality of family relationships
  • goout: high alcohol consumption is strongly related to the frequency of going out, especially for men
  • study time: students using much alcohol generally study fewer hours

The table below shows a summary of key statistics related to heavy and moderate drinkers (high_use = True/False) looking at four variables (three numeric and sex/gender).

alc %>% group_by(sex, high_use) %>% summarise(count = n(), famrel=mean(famrel), goout=mean(goout), studytime=mean(studytime))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 × 6
## # Groups:   sex [2]
##   sex   high_use count famrel goout studytime
##   <fct> <lgl>    <int>  <dbl> <dbl>     <dbl>
## 1 F     FALSE      154   3.92  2.95      2.34
## 2 F     TRUE        41   3.73  3.39      2   
## 3 M     FALSE      105   4.13  2.70      1.90
## 4 M     TRUE        70   3.79  3.93      1.63

The table above indicates what has been stated previously: that Portuguese young heavy drinkers generally go out more than their moderately drinking peers. Also, they spend spend less time studying and they estimate their family relations to be worse.

To indicate the relationship between family relations and heavy drinking, let’s draw a separate box plot. As the chart indicates, moderate consumers of alcohol have better family ties The difference between groups is not, however, as big as the boxplot visualization would initially indicate. Therefore, red dots have been added to indicate mean values by group and sex.

The boxplot visualization indicates that students with high alcohol consumption score lower on the quality of family relationships, but the difference in mean values between groups is much lower than the visualization focused on integer values indicates: the difference is only 0.34 score points for males and 0.19 for females. In this respect, the boxplot visualization is easily misleading.

NB! 1) In both groups there are outliers scoring very low on family relations. In general, however, respondents have scored fairly high (median=4). NB! 2) The figure says little about whether the quality of family relations are the cause or consequence of high alcohol consumption.

# initialise a plot of high_use and famrel
g <- ggplot(alc, aes(x = high_use, y = famrel, col = sex))
g + geom_boxplot() + ylab("family relation") + ggtitle("Quality of family relationships \n by alcohol consumption and sex") + theme(plot.title = element_text(hjust = 0.5)) + stat_summary(fun=mean)
## Warning: Removed 4 rows containing missing values (geom_segment).

We can justify the claim of a misleading visualization by adjusting the code for the previous table and also report the median values. Doing this allows us to see that there is no difference in median values between groups and sexes in family relations. Reporting median values will also tell us that there is a difference in the “goout” variable, with heavy drinkers more inclined towards going out. (The median study time remains the same, 4.)

alc %>% group_by(sex, high_use) %>% summarise(count = n(), famrel_mean=mean(famrel), famrel_median=median(famrel), goout_mean=mean(goout), goout_median=median(goout))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 × 7
## # Groups:   sex [2]
##   sex   high_use count famrel_mean famrel_median goout_mean goout_median
##   <fct> <lgl>    <int>       <dbl>         <dbl>      <dbl>        <dbl>
## 1 F     FALSE      154        3.92             4       2.95            3
## 2 F     TRUE        41        3.73             4       3.39            4
## 3 M     FALSE      105        4.13             4       2.70            3
## 4 M     TRUE        70        3.79             4       3.93            4

Applying logistic regression to understand the relationship between high alcohol usage and select variables

In the logistic regression model, the computational target variable is the log of odds. From this it follows that applying the exponent function to the fitted values gives us the odds. That is, the exponents of the coefficients can be interpreted as odds ratios between a unit change (vs no change) in the corresponding explanatory variable (according to the course video on Data Camp).

From this we see that with an odds ratio of over 2, males are more than twice as likely to have a “success” (that is, become heavy drinkers) compared to their female peers when controlling for family relations, going out and study time. The same goes for those student spending much time going out. On the other hand, those students with good family relations and spending much time studying are not as likely to become heavy drinkers; their odds ratios are only about 2/3 compared with their average peers.

The confidence intervals presented in the two rightmost columns (2.5 % and 97.5%) in the table below gives us an indication about the spread. From this we see for instance that there is a wider spread between males than between those going out. Although the odds ratio is about the same in both groups some males will have considerably higher odds compared to some outgoers.

The data presented here largely supports my initial working hypotheses about men being more likely than women to be high consumers, about students with high alcohol consumption generally scoring lower on the quality of family relationships, about high alcohol consumption being related to the frequency of going out and about lower study times being related to higher alcohol consumption.

However, revisiting the hypotheses reveals to me a somewhat erroneous wording and approach, focusing to much on the faults of those that I have called heavy drinkers (or those with a positive high_use variable). It would perhaps be more fair to comment on the relationship between high daily alcohol doses and background factors, and try to remove any judgmental attitudes.

Below is the printout of the summary of the logistic regression model and the table with odds ratios and confidence intervals.

m <- glm(high_use ~ sex + famrel + goout + studytime, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ sex + famrel + goout + studytime, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5891  -0.7629  -0.4976   0.8304   2.6937  
## 
## Coefficients:
##             Estimate Std. Error z value       Pr(>|z|)    
## (Intercept)  -1.2672     0.7312  -1.733        0.08309 .  
## sexM          0.7959     0.2669   2.982        0.00286 ** 
## famrel       -0.4193     0.1399  -2.996        0.00273 ** 
## goout         0.7873     0.1230   6.399 0.000000000157 ***
## studytime    -0.4814     0.1711  -2.814        0.00490 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 370.69  on 365  degrees of freedom
## AIC: 380.69
## 
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- exp(confint(m))
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                    OR      2.5 %    97.5 %
## (Intercept) 0.2816141 0.06545486 1.1622100
## sexM        2.2165443 1.31841735 3.7630180
## famrel      0.6575181 0.49791884 0.8636137
## goout       2.1974324 1.73873662 2.8198312
## studytime   0.6179355 0.43751933 0.8576353

Due to the chosen method at the start of this exercise, all of the variables controlled for were found to be statistically significant, one on a 0.1 per cent level and three on a 1 per cent level. I can therefore explore the predictive power of my model as such.

# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to "alc"
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probabilities > 0.5)
# tabulate the target variable versus the predictions. The numbers are the count of individuals.
table(high_use = alc$high_use, prediction = alc$prediction) 
##         prediction
## high_use FALSE TRUE
##    FALSE   230   29
##    TRUE     59   52
# tabulate the target variable versus the predictions. The table shows the proportions.
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.62162162 0.07837838 0.70000000
##    TRUE  0.15945946 0.14054054 0.30000000
##    Sum   0.78108108 0.21891892 1.00000000

Based on the data and the reported alcohol intake, exactly 30 per cent of respondents (111 of 370) were classified as heavy drinkers (high_use = true). Equally, 70 per cent (259 individuals) were classified as non-heavy drinkers (high_use = false). According to the model, 29 of 259 moderately drinking individuals (11 per cent) were erroneously predicted to be heavy drinkers. Out of 111 heavy drinkers, a majority (59) were erroneously predicted not be high_users although they were according on the data.

Although the model did fairly well in recognizing non-heavy users, it did worse in predicting heavy drinkers among those who in fact were (based on the reported intake). Overall, the model predicted 78 per cent of respondents to be non-heavy users, while the actual proportion was 70 per cent.

The proportion of wrongly classified individuals is displayed visually in the plot below. As the visualization makes clear, the proportion of inaccurately classified individuals (i.e. the training error) is fairly high. The mean prediction error can be computed by defining a loss function and comparing classifications and probabilities. The calculation indicates that nearly 24 per cent are wrongly classified. I would not have expected the analysis to give such a high number. On the other hand, the model could become more accurate with a higher number of observations.

# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
# define the geom as points and draw the plot
g + geom_point()

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2378378

A note on the usefulness of the model

At the loss rate of 24 per cent, it seems as if the model is not very accurate. Can I use the information? It depends on the purpose. For instance, if the goal was to target alcohol drinkers and feed them with advertisements on social media (assuming that heavy drinkers would be more inclined to buy booze), the information could definitely be useful (if one wanted to target alcohol ads at minors…). For identifying who will become an alcoholic in five years and who would need interrogative and intervening actions, the model is far too inaccurate. At this level of accuracy, one could perhaps only target information campaigns on the potential harms caused by drinking at an early age.

Bonus: Performing ten-fold cross-validation on the model

With ten-fold cross-validation, the prediction error seems to be larger than when using the loss function (with a difference of about one percentage point).

# Performing ten-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2432432
# calculating the difference between using ten-fold cross-validation and a loss function
cv$delta[1]-loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.005405405

Super-Bonus: Performing cross-validation to compare the performance of different logistic regression models

Finally, I will perform cross-validation to compare the performance of different logistic regression models using different sets of predictors. I start with the maximum number of predictors and explore the changes in the training and testing errors as I drop variables one by one. The original dataset contained 33 variables, but since high_use is directly related to Walc and Dalc, I choose to exclude these for computational feasibility.

In practice, I begin with creating a vector with running numbers in decreasing order from 31 to 1 indicating how many variables I will test at a time. I then create two more empty vectors, where I will save the results from the computational exercise. The three vectors will be used to construct the resulting data frame.

I then write a for loop to construct as many formulas as there are variables at any given time. I begin by making the cross-validation using the largest number of variables (31) The results are saved in a data frame along with the number of variables used. The resulting table will have three columns indicating Number of variables, Prediction error rate and Training error rate and 31 rows, one for every run.

Finally, I construct a plot indicating how both prediction and training errors decrease as the number of variables increase, with prediction errors always being greater. Looking at the resulting plot, I found it initially tough to digest the great variance and especially the initially increasing trend in prediction errors. But I guess that might have been be a result of a varying number of statistically significant variables being used in the different models.

The downside of the fairly long code written below is the time it takes to execute it. I have found that executing this last chunk alone - comparing different models with up to 31 variables - takes about three minutes. The time needed makes me relucant to run the code and test for any small changes, unless I have reduced the number of variables to a handful. If you have any suggestions on how to improve the code and speed up the process, I will gladly appreaciate your suggestions!

library(dplyr)
library(boot)
howmanyvar <- 31 #Enter here how many variables you want to test for. 31 is the maximum. (33 variables were included in the original dataset but two of these will always be excluded for computational feasibility and avoidance of near-perfect correlation.)
#create vector, in sequence, starting from the number above, descending by 1.
v <- rev(seq(1,howmanyvar))
#create empty numeric vectors of same length for the results.
trainingerrors <- integer(howmanyvar)
predictionerrors <- integer(howmanyvar)
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
for(i in v) {
#Within the for loop, I first create temporary subsets called "alctest". I choose to exclude variables "Dalc", "Walc" as these were found to cause warnings when executing the logistic regression model and counting probabilities (highly correlated with "high_use"). I also exclude "probability" and "prediction" as the probability has been calculated based on the whole dataset and this will now be replaced.
exclvarnames <- names(alc) %in% c("Dalc","Walc", "alc_use", "probability", "prediction")
alctest <- alc[!exclvarnames == T]
#From the alctest subset I will gradually drop columns one at a time, starting from the number of variables entered in "howmanyvar". In alctest, 32 has become the index number for the variable high_use. Therefore, I always want to include that one.
alctest <- dplyr::select(alctest, 1:v[i], 32)
#However, I want to exclude "high_use" from the vector with columns names that I want to use on the right-hand side in the formula.
fnames <- names(alctest)[names(alctest) !="high_use"]
# create formula where "alc_use" is explained by the variables
f <- as.formula(paste("high_use~",paste(fnames,collapse="+")))
# run a logistic regression
m2 <- glm(f, data = alctest, family = "binomial")
# predict() the probability of high_use using model m2
probabilities <- predict(m2, type = "response")
# add the predicted probabilities to "alctest"
alctest <- mutate(alctest, probability = probabilities)
# compute the average number of wrong predictions in the (training) data and save the result in vector
trainingerrors[i] <- loss_func(alctest$high_use,alctest$probability)
# K-fold cross-validation
cv <- cv.glm(data = alctest, cost = loss_func, glmfit = m2, K = nrow(alctest))
# compute the average number of wrong predictions in the cross validation and save the result in vector
predictionerrors[i] <- cv$delta[1]
}
results <- data.frame(variables=v, trainingerrors=trainingerrors, predictionerrors=predictionerrors)
results
##    variables trainingerrors predictionerrors
## 1         31      0.1918919        0.2567568
## 2         30      0.1891892        0.2621622
## 3         29      0.1945946        0.2540541
## 4         28      0.1972973        0.2540541
## 5         27      0.2162162        0.2783784
## 6         26      0.2108108        0.2621622
## 7         25      0.2162162        0.2702703
## 8         24      0.2162162        0.2729730
## 9         23      0.2675676        0.3162162
## 10        22      0.2594595        0.3270270
## 11        21      0.2729730        0.3216216
## 12        20      0.2675676        0.3297297
## 13        19      0.2702703        0.3297297
## 14        18      0.2783784        0.3270270
## 15        17      0.2810811        0.3162162
## 16        16      0.2783784        0.3162162
## 17        15      0.2810811        0.3081081
## 18        14      0.2810811        0.3081081
## 19        13      0.2729730        0.3135135
## 20        12      0.2783784        0.3162162
## 21        11      0.2729730        0.3324324
## 22        10      0.2918919        0.3189189
## 23         9      0.2864865        0.3108108
## 24         8      0.2918919        0.3162162
## 25         7      0.2945946        0.3054054
## 26         6      0.2945946        0.3054054
## 27         5      0.2918919        0.3027027
## 28         4      0.2972973        0.3135135
## 29         3      0.3054054        0.3135135
## 30         2      0.3000000        0.3000000
## 31         1      0.3000000        0.3000000
p <- ggplot(results, aes(x=variables)) + geom_line(aes(y=predictionerrors, color="prediction")) + geom_line(aes(y=trainingerrors, color="training"))
p + ggtitle("Relation between error rates\n and number of variables") + theme(plot.title = element_text(hjust = 0.5)) + xlab("Number of variables") + ylab("Error rate") + scale_color_discrete(name="Type of error")


RStudio Exercise 4

date()
## [1] "Mon Dec  6 00:50:19 2021"

The Boston dataset - a brief description

The Boston dataset, included in the MASS package, contains 506 observations of 14 variables relating to housing in the Boston region. Each observation describes a Boston suburb or town. Variables include information such as crime rate, air pollution, ethnic composition, proportion of land for large properties or industries, taxation, distances, communications and pricing. As has been state elsewhere the dataset has become a much-used pedagogical tool for teaching regression analysis and machine learning. First, let us swiftly explore the contents!

# accessing the MASS package
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
# loading the Boston data
data("Boston")
# typing ?Boston will open the documentation on the variables in the R console. A description can also be found [here] (https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html).
?Boston
# a summary of the content of the variables is given below
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

It is perhaps hard to form an understanding of the relationship between the 14 variables simply by mapping pairs. Drawing a corrplot diagram allows us to get a quick overview of the correlation between variables. The greater the circle, the greater the correlance. Red indicates negative correlance, blue positive.

# plotting the relation between variables with corrplot
library(tidyr)
library(corrplot)
## corrplot 0.92 loaded
# rounding values
cor_matrix<-cor(Boston) %>% round(2)
# drawing a corrplot
corrplot(cor_matrix, method="circle", type = "upper", tl.cex = 0.6, tl.pos = "d")

# contrary to the instructions provided on Data Camp, I choose to exclude the 'cl.pos = "b"' command, as I prefer to have the color legend vertically on the right rather than horizontally below the figure.

What sticks out from the visualization above is the strong negative correlation between the distance to employment centres (dis) and the proportion of old houses (age), nitrogen oxides concentration in the air (nox) and proportion of non-retail businesses (indus).That is, the further away we are from employment centres, the higher the proportion of new houses, the higher the share of industrial properties and the higher the concentration of nitrogen oxides in the air.

Similarly, there is a strong positive correlation between accessibility to highways (rad) and property-tax rate (tax). That is, the better the access to highways, the higher the property tax rate.

In the following section, I will standardize the variables which allows me to compare them more easily and perform additional operations using them. (Ideally, I would explore the differences between the standardized and non-standardized variables graphically, for instance using the ggplot bar graph technique described here, but as this was not specifically demanded, I suspect this would perhaps be beyond the scope of this exercise.) For now, it will suffice to print out a summary of the variables in the standardized set.

From comparing the summaries, it can be seen that the range of select variables has decreased significantly (e.g. crim, zn, indus). In fact, the maximum value of crim (9.92) represents the maximum value of all standardized variables. As all variables have now been centered around a mean of zero, minimum values have all become negative for all variables.

# center and standardize variables
boston_scaled <- scale(Boston)
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
# summarize the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

As instructed, I continue by creating a categorical variable for the crime rate using the quantiles as break points. I choose to name the variables from low to high. I then drop the old crime rate variable from the boston_scaled dataset.

# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low","med_low","med_high","high"))
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

Moreover, I divide the dataset into train and test sets so that 80% of the data belongs to the train set. This is done as a means of preparation for fitting a linear discriminant analysis model on the train set.

# dividing the dataset into train and test sets, so that 80% of the data belongs to the train set. I randomly assign 80 per cent of the rows in the Boston dataset to the train set.
n <- nrow(Boston)
ind <- sample(n,  size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set 
test <- boston_scaled[-ind,]

Now that I have created the train and test sets, I continue by fitting the linear discriminant analysis on the train set. I choose the categorical crime rate (crime) as target variable and all the other variables as predictor variables. I will also plot the LDA fit. The chunk below includes a code for defining a function for enriching the plot with arrows.

# fitting the linear discriminant analysis on the train set using the categorical crime rate as target variable and all other variables in the dataset as predictor variables.
lda.fit <- lda(crime ~ ., data = train)
# this function can be used for adding arrows to the biplot that will be plotted next
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plotting the lda results with arrows added
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 2)

The image drawn above gives an indication of in which direction the various variables draws the results. The longest arrow ic clearly given by the rad variable, zn and nox also pulling the results fairly strongly in different directions.

Continuing to follow the given instructions, I save the crime categories from the test data before removing the crime variable. This allows me to evaluate the correctness of predictions when using the test data to predict crime classifications. The cross tabulation of predictions and correct results is given below.

The results of the cross tabulation are interesting especially looking at the four corners. None of the areas where the crime rate was predicted low actually had high crime rates and vice versa. And similarly: where crime rates were predicted high, they actually were high, and where they were predicted low, they chiefly were low. Some variation occured nevertheless as to how right the prediction was. For instance, of those areas with low rates only half were correctly classified as areas with low rates (the other half were classified as med_low with one instance as med_high). It seems as if the model did a better job in getting areas with high criminal rates right than areas with low rates.

# saving crime categories from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predicting classes with the test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulating the results with the crime categories from the test set (removed from the test set)
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       12      10        3    0
##   med_low    6      14        4    0
##   med_high   0       9       20    0
##   high       0       0        1   23

I will now continue to execute the last step of the exercise proper. I begin by reloading the Boston dataset and standardize the variables. I will prepare a euclidean distance matrix to calculate the distances between the observations, and then run a k-means algorithm on the dataset to let the computer sort and clusterize the data. The clusters generated by the k-means algorithm will be plotted.

# reloading the Boston dataset
library(MASS)
data("Boston")
# standardizing the dataset
Boston <- scale(Boston)
# preparing a euclidean distance matrix
dist_eu <- dist(Boston)
# running a k-means algorithm on the dataset
km <-kmeans(Boston, centers = 3)
# plotting the results
pairs(Boston, col = km$cluster)

As is seen, the chart above looks really busy. We can determine the optimal number of clusters by plotting the results of a k-means algorithm run with the numbers from 1 to 10. The result is given below.

The optimal number of clusters is supposed to be when the curve drops sharply. In this case, there is no self-evident answer to what is the optimal number. In my interpretation, the drop is at its sharpest with two cluster centers, after which the decline slows down. I find it reasonable to cluster using two centers. A new plot is printed below.

# calculating the total within sum of squares (up to 10 clusters)
set.seed(123) #this command is used in conjunction with the function
twcss <- sapply(1:10, function(k){kmeans(Boston, k)$tot.withinss})
# visualizing the results
library(ggplot2)
qplot(x = 1:10, y = twcss, geom = 'line') + scale_x_continuous(breaks = c(2,4,6,8,10), limits=c(1,10))

# running again the k-means algorithm on the dataset with the newly determined optimal number of clusters
km <-kmeans(Boston, centers = 2)
# plotting the results with pairs
pairs(Boston, col = km$cluster)

For the bonus section, I will run the k-means algorithm on the original (standardized) Boston data with 3 cluster centers. An LDA is performed with clusters as target classes. The biplot with arrows is given below. As it appears as if zn, nox, medv and tax are the most influential linear separators for the clusters. That is, the proportion of residential land allocated for large properties, air pollution, price level and property tax rate seem to be the variables most strongly influencing which cluster a particular area belongs to.

# reloading the Boston dataset once more
library(MASS)
data("Boston")
# standardizing the dataset
Boston <- as.data.frame(scale(Boston))
# performing k-means on the dataset
km <-kmeans(Boston, centers = 3)
# saving clusters to be used with the LDA
clusters <- km$cluster
# adding the new clusters variable to the set
Boston <- data.frame(Boston, clusters)
# dividing the dataset into train and test sets to be used with the LDA
n <- nrow(Boston)
ind <- sample(n,  size = n * 0.8)
train2 <- Boston[ind,]
# removing chas because I repeatedly received the error message "variable 4 appears to be constant within groups"
train2 <- dplyr::select(train2, -chas)
# performing the lda
lda.fit2 <- lda(clusters ~ ., data = train2)
# defining the arrows function
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
# plotting the lda results with arrows
plot(lda.fit2, dimen = 2, col=clusters, pch=clusters, ylim=c(-5,7),xlim=c(-5,5))
lda.arrows(lda.fit2, myscale = 4)

For the super bonus section, I apply the given code that helps me to create a matrix product, a projection of the data points that will be visualized. I will draw two 3D plots, one where the color is defined by the categorical crime classes, another where the color is defined by the clusters of the k-means. As can be seen here, the shape of the plots is identical. However, the colouring attributed by the categorical crime classes is much more ‘neat’, aiding the eye in forming an understanding of which elements that belong together. For instance, all yellow values are gathered at the left hand side of the chart (with high x values). Turning and twisting the graphic with the mouse helps understanding the data even better.

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
# accessing the plotly package in order to create a 3D plot of the columns of the matrix product. (I have ran the command install.packages("plotly") once already.)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = ~classes)
# extracting the clusters from the second train set
cluscol <- train2$clusters
# drawing the second plot
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = ~cluscol)

RStudio Exercise 5

date()
## [1] "Mon Dec  6 00:50:34 2021"

The Human dataset - an overview

The human dataset contains country-specific indicators relating to the Human Development Index (HDI), provided by the United Nations Development Programme. The index measures achievements in dimensions of human development such as health and longevity, education and standard of living.

Variables included in this exercise indicate the ratio of women to men in secondary education and labour force, as well as expected length of education and life. Variables also include Gross National Income as a measure of standard of living, and maternal mortality and adolescent birth rate as measures of the level of health care provided to young women.

# reading the human data from my data folder
human <- read.csv("./data/human.csv", stringsAsFactors = F, row.names = 1)
# exploring the data structure
head(human)
##              EduRatio  LabRatio EducExp LifeExp   GNI Matmort AdolBirthRt
## Norway      1.0072389 0.8908297    17.5    81.6 64992       4         7.8
## Australia   0.9968288 0.8189415    20.2    82.4 42261       6        12.1
## Switzerland 0.9834369 0.8251001    15.8    83.0 56431       6         1.9
## Denmark     0.9886128 0.8840361    18.7    80.2 44025       5         5.1
## Netherlands 0.9690608 0.8286119    17.9    81.6 45435       6         6.2
## Germany     0.9927835 0.8072289    16.5    80.9 43919       7         3.8
##             ParlRep
## Norway         39.6
## Australia      30.5
## Switzerland    28.5
## Denmark        38.0
## Netherlands    36.9
## Germany        36.9
summary(human)
##     EduRatio         LabRatio         EducExp         LifeExp     
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Matmort        AdolBirthRt        ParlRep     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
# plotting pairs
library(GGally)
ggpairs(human)

Below is a corrplot diagram of the correlation relations between variables. Based on the data, it is interesting to note that parliament representation does not seem to be particularly strongly correlated with anything. At least from the corrplot diagram, no strong correlation can be discerned (share of females in parliament is moderately correlated with labour force ratio, expected length of education and life expectancy). Neither does labour force ratio seem to be strongly correlated with anything.

# drawing a corrplot
library(corrplot)
cor(human) %>% corrplot()

Performing principal component analysis (PCA)

First, I will perform a principal component analysis (PCA) on the non-standardized human data, as instructed in the exercise. The variability captured by the principal components is given in the printout of values below. The plot highlights the impact of the Gross National Income (GNI) variable as this was the one variable containing the largest absolute numbers (maximum values over 100 times larger than in any other variable).

# perform principal component analysis (with the SVD method) on the human dataset in its *non-standardized* form
pca_human <- prcomp(human)
# exploring the variability captured by the principal components
summary(pca_human)$importance[2,]
##    PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8 
## 0.9999 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.6, 0.9), col = c("grey40", "deeppink2"), main = "Impact of GNI on HDI")
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

Now, I will redo the PCA analysis with standardized variables.

# standardize variables
human_std <- scale(human)
# perform principal component analysis (with the SVD method) on the human dataset in its *standardized* form
pca_human_std <- prcomp(human_std)
# exploring the variability captured by the principal components
s <- summary(pca_human_std)$importance[2,]
s
##     PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8 
## 0.53605 0.16237 0.09571 0.07583 0.05477 0.03595 0.02634 0.01298
# save rounded percetanges of variance captured by each PC (to be used as axis labels)
pca_pr <- round(s*100, digits = 1)
# create object pc_lab (to be used as axis labels)
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot of the principal component representation and the original variables
biplot(pca_human_std, choices = 1:2, cex = c(0.6, 0.9), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2], main = "Interconnected variables impacting HDI")

As was seen above, the first plot drawn from the non-standardized version was extremely hard to read as almost all results in the scatter plot were pushed into the upright corner into a really crowded space. Because of the high absolute values reported in the GNI variable, that is about the only arrow whose direction is visible in the first plot; I can’t even tell if there are any arrows pointing in any other direction (I suppose the arrows of ParlRep and LabRatio are pointing upwards, but they are at least not properly visible.)

The Human Development Index (HDI) is a summary measure of average achievement in key dimensions of human development: a long and healthy life, being knowledgeable and have a decent standard of living according to the UNDP.

From the latter plot we can discern three groups of variables that seem to influence the HDI index value in different ways. Life expectancy, maternal mortality and adolescent birth rate are all related to living a long and healthy life. Life expectancy is negatively correlated with maternal mortality and adolescence birth rate, and so the arrows here pull results in opposite directions.

The proportion of women in labour force and parliament (visible as arrows pointing upwards) can be interpreted as variables measuring women’s possibilities to participate in and impact society. The proportion of women in parliament is moderately correlated with the proportion of women in the labour force.

Interestingly, the ratio of women to men in secondary education is more strongly related to life expectancy than to the proportion of women in parliament or the ratio of women to men in the labour force. Being knowledgeable seems to be more strongly correlated with being able to live a long and healthy life than the ability to participate in working life and national politics.

Balancing national income and gender equality - interpreting the first two principal component dimensions of the PCA

The scatter plot presented through the PCA analysis is a very interesting one. The GNI variable that stood out in the first plot is now almost entirely hid behind the EduRatio variable (and thus very strongly correlated with that variable). In a way, this is the key to understanding the horizontal x axis of the plot. On the left-hand side, we find rich countries, both European, Asian and Arab. On the right-hand side, we find poorer countries, many of which have been hit by war and poverty and lack good public health services.

The vertical y axis is also an interesting one, although initially harder to label. I speculated over wheter it was a conservative-liberal axis or progressive-reactionary axis, before I concluded that the y axis is all about gender equality. Let me explain why.

The angle between a variable and a PC axis can be interpret as the correlation between the two. In the case of the x axis, the difference in angle is minimal between the variables at each horizontal end. As the difference in angle between the ParlRep and LabRatio variables and the y axis are almost as small (although significantly larger), one could understand the y axis as an axis reflecting equality between women and men (in politics and the workforce), placing Rwanda at the top and hard-line Arab states at the bottom. (Rwanda’s top placement might come in as a surprise for somebody, but the power balance between men and women in Rwanda, and attitudes and approaches chosen by Rwandian women, have been elaborated in some media articles for example here.)

In principal component analyses, the first principal component captures the maximum amount of variance from the features of the original data. In this instance, the PC1 (relating to life expectancy and health issues) explains just over half of the variance. But the gender equality aspects relating to female representation in the labour force and national politics account for only a sixth of the variance between countries. (Note how one single variable, the GNI explained 99 per cent of the variation in the erroneous first plot.)

A note on the side: What the plots - and the data - has not said anything about is the division of income and wealth distribution within countries. Attempts to quantify inequality have at various times been made using for instance the Gini coefficient, although this measurement unit has shortcomings of its own (it does not for instance take into account the effect of income redistributions, differences in living expenses between countries as well as distribution of wealth).

Exploring tea drinking habits

The tea dataset of the FactoMineR package includes 300 observations of 36 variables relating to tea drinking habits. Most variables are factors with 2 levels (i.e. binaries), but some variables include factors with more levels. Initially, I will visualise the data by drawing histograms of six of the most interesting variables. I will then conduct a Multiple Correspondence Analysis on the tea data and offer an interpretation the results of the MCA and draw a variable biplot.

# loading the tea dataset
library(FactoMineR) #I installed Factominer before writing this code
library(ggplot2)
library(dplyr)
library(tidyr)
data("tea")
# exploring the tea dataset
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   15-24:92   1/day      : 95  
##  middle      :40   sportsman    :179   25-34:69   1 to 2/week: 44  
##  non-worker  :64                       35-44:40   +2/day     :127  
##  other worker:20                       45-59:61   3 to 6/week: 34  
##  senior      :35                       +60  :38                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming          exciting  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255   exciting   :116  
##  Not.feminine:171   sophisticated    :215   slimming   : 45   No.exciting:184  
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##         relaxing              effect.on.health
##  No.relaxing:113   effect on health   : 66    
##  relaxing   :187   No.effect on health:234    
##                                               
##                                               
##                                               
##                                               
## 
?tea
dim(tea)
## [1] 300  36
# moving on to visualization. I choose to visualize six of the variables that I find most interesting.
keep_columns <- c("age_Q","frequency","How","price","SPC","Tea")
tea_visu <- select(tea, one_of(keep_columns))
gather(tea_visu) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

# conducting the Multiple Correspondence Analysis. I continue with my selection of six interesting variables.
mca <- MCA(tea_visu, graph = FALSE)
summary(mca)
## 
## Call:
## MCA(X = tea_visu, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.350   0.292   0.234   0.214   0.206   0.198   0.189
## % of var.              9.127   7.610   6.092   5.588   5.386   5.172   4.926
## Cumulative % of var.   9.127  16.737  22.830  28.417  33.803  38.975  43.901
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
## Variance               0.181   0.180   0.178   0.170   0.166   0.152   0.149
## % of var.              4.730   4.705   4.636   4.431   4.334   3.977   3.897
## Cumulative % of var.  48.631  53.336  57.972  62.403  66.737  70.714  74.611
##                       Dim.15  Dim.16  Dim.17  Dim.18  Dim.19  Dim.20  Dim.21
## Variance               0.145   0.140   0.134   0.123   0.113   0.109   0.108
## % of var.              3.789   3.663   3.484   3.222   2.960   2.849   2.813
## Cumulative % of var.  78.400  82.064  85.547  88.769  91.729  94.577  97.390
##                       Dim.22  Dim.23
## Variance               0.055   0.046
## % of var.              1.423   1.188
## Cumulative % of var.  98.812 100.000
## 
## Individuals (the 10 first)
##                Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1           |  0.239  0.054  0.008 |  1.146  1.501  0.184 |  0.579  0.478
## 2           |  0.461  0.202  0.060 |  0.798  0.727  0.181 |  0.313  0.140
## 3           |  0.025  0.001  0.000 |  0.459  0.241  0.057 |  0.211  0.063
## 4           | -0.847  0.684  0.411 | -0.364  0.151  0.076 |  0.155  0.034
## 5           | -0.125  0.015  0.008 |  0.193  0.042  0.018 | -0.049  0.003
## 6           | -0.973  0.901  0.257 | -0.391  0.175  0.042 |  0.122  0.021
## 7           |  0.055  0.003  0.001 |  0.531  0.322  0.069 |  0.818  0.955
## 8           |  0.401  0.153  0.035 |  0.702  0.563  0.108 |  1.012  1.462
## 9           |  0.454  0.196  0.051 |  0.693  0.549  0.118 |  0.767  0.839
## 10          |  0.680  0.441  0.117 |  0.520  0.309  0.069 |  0.911  1.185
##               cos2  
## 1            0.047 |
## 2            0.028 |
## 3            0.012 |
## 4            0.014 |
## 5            0.001 |
## 6            0.004 |
## 7            0.163 |
## 8            0.225 |
## 9            0.144 |
## 10           0.210 |
## 
## Categories (the 10 first)
##                 Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2  v.test
## 15-24       |  -1.028  15.424   0.467 -11.817 |  -0.722   9.133   0.231  -8.303
## 25-34       |  -0.194   0.412   0.011  -1.834 |   0.556   4.059   0.092   5.252
## 35-44       |   0.438   1.220   0.030   2.972 |   0.967   7.123   0.144   6.558
## 45-59       |   0.410   1.627   0.043   3.580 |   0.732   6.227   0.137   6.396
## +60         |   1.721  17.867   0.429  11.332 |  -1.454  15.306   0.307  -9.577
## 1/day       |  -0.080   0.098   0.003  -0.947 |   0.450   3.670   0.094   5.302
## 1 to 2/week |  -0.320   0.717   0.018  -2.297 |  -0.096   0.077   0.002  -0.686
## +2/day      |   0.149   0.446   0.016   2.204 |  -0.262   1.658   0.050  -3.879
## 3 to 6/week |   0.084   0.038   0.001   0.519 |  -0.157   0.159   0.003  -0.969
## alone       |  -0.217   1.454   0.087  -5.107 |  -0.080   0.237   0.012  -1.882
##                 Dim.3     ctr    cos2  v.test  
## 15-24       |   0.285   1.775   0.036   3.275 |
## 25-34       |  -0.740   8.994   0.164  -6.995 |
## 35-44       |   0.865   7.127   0.115   5.870 |
## 45-59       |   0.058   0.048   0.001   0.502 |
## +60         |  -0.349   1.099   0.018  -2.296 |
## 1/day       |  -0.644   9.372   0.192  -7.580 |
## 1 to 2/week |   0.576   3.468   0.057   4.127 |
## +2/day      |   0.175   0.926   0.023   2.594 |
## 3 to 6/week |   0.400   1.297   0.020   2.475 |
## alone       |  -0.012   0.006   0.000  -0.271 |
## 
## Categorical variables (eta2)
##               Dim.1 Dim.2 Dim.3  
## age_Q       | 0.767 0.732 0.267 |
## frequency   | 0.027 0.097 0.211 |
## How         | 0.156 0.074 0.139 |
## price       | 0.162 0.097 0.207 |
## SPC         | 0.677 0.745 0.327 |
## Tea         | 0.310 0.005 0.250 |
plot(mca, invisible=c("ind"), habillage = "quali", graph.type = "classic")

As the distance between variable categories in a MCA provides a measure of their similarity, this gives me some hints as to how to interpret tea drinking habits of different population groups. The pink colour indicates occupation, black indicates age, blue indicates tea brand, red indicates tea drinking frequency, brown indicates kind of tea (green or black or other) and green indicates ways of drinking tea (e.g. with milk or lemon).

Heavily stereotyped, one could suggest that in the down-right corner we find non-workers and seniors who might have an occasional cup of tea of some cheap label every now and then. In the down-left corner, we find young students drinking private label teas, some of whom drink their teas fairly often, even twice a day. In the up-left corner, we find workers and employees having an occasional cup once a week or a regular cup once a day of some unknown or variable brand. In the up-right corner, we find the more posh, middle- and upper-class drinkers who do not necessarily drink tea that regularly (perhaps they prefer coffee), but who are presumably picky about their drink (the branded and upscale tea assortments are found here). Also, the closer we get to the middle and senior workers, the more green tea is consumed.

It’s worth noting, that the two dimensions plotted in the MCA factor map do not account for a significant amount of the variation - in fact, less than ten per cent each. That is, there are other variables who jointly explain much more of the variation in tea drinking habits.